PTMs and half-lives

Check that phosphorylation is the most abundant (literature).

Proteins with a short half-life

Proteins can have varying half-lives.

Below is a comparison of the distribution of the half-lives that was found in literature and the distribution of a subset of those half-lives in the proteins found in the dataset. (no outliers were removed yet)

Outliers

I want to remove the proteins with a very high number of log10(counts_norm_abund_len).

Detecting the outliers:

[1] "Dimensions BEFORE removing the outliers"
[1] 181235     18
[1] "Dimensions AFTER removing the outliers"
[1] 180165     18

The outliers have been removed now. What is the resulting distribution?

Looking in more detail

What are the most modified proteins?

mod <- human_ptms_hl_short %>% 
  group_by(Uniprot_entry_name) %>% 
  summarise(sum = sum(counts_norm_abund_len)) %>% 
  arrange(desc(sum))

mod[1:5,]
# A tibble: 5 × 2
  Uniprot_entry_name   sum
  <chr>              <dbl>
1 EAPP_HUMAN         0.190
2 RASF3_HUMAN        0.178
3 MTBP_HUMAN         0.177
4 CH088_HUMAN        0.171
5 NUFP1_HUMAN        0.170

Looking at a particular half-life range:

#write.table(df$Uniprot_entry_name, file = '/Users/anastasialinchik/Downloads/proteins.tsv', row.names = F, sep="\t", quote = F)

PTMs

PTMs of interest:

  • PTMs that control autophagy

    • phosphorylation

    • ubiquitination -> need to use the new dataset

    • acetylation

  • oxPTMs

    • you have a list of these
  • Methylation eg of histones

  • K Acylation -> need to get this from this paper

  • AGEs as markers of ageing

Phosphorylation

This is already without outliers

  • Only the modification [21]Phospho is present here.

Splitting the dataset in a group with phosphorylation proteins and another group with all remaining proteins.

It is not necessary to include another density line with all of the proteins. You can just compare the two distributions.

Comparison

[1] "Non-modified"
[1] 252   4
[1] "Modified"
[1] 914   4

Testing whether the half-lives between groups are significantly different. Wilcoxon test (note that the sample sizes are uneven). The p value was adjusted using the formula p*sqrt(N/100), where N = n1+ n2.


    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = 2.4341, df = 1164, p-value = 0.01508
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                   0.09386352 
[1] "adjusted p-value (Good's Bayes adjustment)"
gPhosphorylated 
     0.05149134 
[1] 1166    4

Acetylation

  • Filtered by the [1]Acetyl modification.

[1] "Non-modified"
[1] 616   4
[1] "Modified"
[1] 550   4

    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = 3.5568, df = 1164, p-value = 0.0003905
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                    0.1638661 
[1] "Adjusted p-value"
gAcetylated 
0.001333409 
[1] 1166    4

Ubiquitination

Ubiquitination has the classification ‘Other’. Take that as one group. The second group is all of the PTMs. 890 proteins overlap so you have 289 proteins taht are not ubiquitinated and have PTMs and we know their half-lives. These make up the second group.

[1] "Non-modified"
[1] 278   4
[1] "Modified"
[1] 898   4

    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = -2.3096, df = 1174, p-value = 0.02108
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                  -0.09965387 
[1] "Adjusted p-value"
gUbiquitinated 
    0.07230265 
[1] 1176    4

Methylation

  • Filtered by the [34]Methyl modification

Violin plots

[1] "Non-modified"
[1] 391   4
[1] "Modified"
[1] 775   4

    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = 0.90091, df = 1164, p-value = 0.3678
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                   0.04207529 
[1] "Adjusted p-value"
gMethylated 
   1.256003 
[1] 1166    4

oxPTMs

This is only for proteins that are related to ageing.

All PTMs related to oxidative damage in general, not only oxidation.

[1] "Non-modified"
[1] 19  4
[1] "Modified"
[1] 1171    4

    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = 1.5091, df = 1188, p-value = 0.1315
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                    0.1974355 
[1] "Adjusted p-value"
  goxPTMs 
0.4537599 
[1] 1190    4

Lysine acylations

Violin plot

[1] "Non-modified"
[1] 707   4
[1] "Modified"
[1] 465   4

    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = 0.49822, df = 1170, p-value = 0.6184
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                   0.03815802 
[1] "Adjusted p-value"
gK acylation 
    2.117143 
[1] 1172    4

AGEs

Violin plots

[1] "Non-modified"
[1] 902   4
[1] "Modified"
[1] 269   4

    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = -1.0866, df = 1169, p-value = 0.2774
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                  -0.07799565 
[1] "Adjusted p-value"
    gAGEs 
0.9494111 
[1] 1171    4

Binning

Hypothesis: The higher the half-life, the greater the number of PTMs.

Phosphorylation

oxPTMs

methylation

Ubiquitination, acetylation, lysine, AGEs

General:

Broken down by the modifications

Check the number of proteins in each bin.

# A tibble: 5 × 2
  hl_group protein_count
  <chr>            <int>
1 0-5                234
2 10-15              256
3 15-20              182
4 20+                286
5 5-10               226

oxPTMs + phospho

`summarise()` has grouped output by 'hl_group'. You can override using the
`.groups` argument.
# A tibble: 15 × 3
# Groups:   hl_group [5]
   hl_group mod_group      protein_count
   <chr>    <chr>                  <int>
 1 0-5      -                        233
 2 0-5      Phosphorylated           165
 3 0-5      oxPTMs                   227
 4 10-15    -                        253
 5 10-15    Phosphorylated           207
 6 10-15    oxPTMs                   254
 7 15-20    -                        181
 8 15-20    Phosphorylated           146
 9 15-20    oxPTMs                   184
10 20+      -                        285
11 20+      Phosphorylated           235
12 20+      oxPTMs                   285
13 5-10     -                        220
14 5-10     Phosphorylated           161
15 5-10     oxPTMs                   221

Proteins with a long half-life

Long-lived proteins can be used as estimators of chronological age. Long-lived proteins can be defined in different ways, for example based on the half-life of the protein when compared to the average half-life of proteins in the organism. In this case, long-lived proteins were obtained from the following study: paper. Proteins were classified as long-lived based on their degree of degradation during the experiment and therefore it was possible to discover new long-lived proteins (no a priori assumptions were made).

The study identified a list of long-lived proteins in rats, therefore human orthologs of these proteins were found.

Plot the data distributions

Outliers

[1] "Dimensions BEFORE removing the outliers"
[1] 2228928      18
[1] "Dimensions AFTER removing the outliers"
[1] 2131902      18

All of the outliers have been removed.

Check the distribution of the half-lives:

Also need to remove the proteins with very large half-lives (The proteins with short half-lives were not removed even if there were identified as outliers):

[1] "Dimensions BEFORE removing the outliers"
[1] 2131902      18
[1] "Dimensions AFTER removing the outliers"
[1] 2063092      18

Now the exact same thing but for `human_complete_hl_long`

(Note that the dataset with the original data still has the proteins with very large half-lives so the scale needs to be based on the subset, not the original set)

Looking in more detail

What are the most modified proteins?

mod <- human_ptms_hl_short %>% 
  group_by(Uniprot_entry_name) %>% 
  summarise(sum = sum(counts_norm_abund_len)) %>% 
  arrange(desc(sum))

mod[1:5,]
# A tibble: 5 × 2
  Uniprot_entry_name   sum
  <chr>              <dbl>
1 EAPP_HUMAN         0.190
2 RASF3_HUMAN        0.178
3 MTBP_HUMAN         0.177
4 CH088_HUMAN        0.171
5 NUFP1_HUMAN        0.170

Looking at a particular half-life range:

#write.table(df$Uniprot_entry_name, file = '/Users/anastasialinchik/Downloads/proteins.csv', seprow.names = FALSE, quote = FALSE)

PTMs

Phosphorylation

Violin plot:

[1] "Non-modified"
[1] 157   4
[1] "Modified"
[1] 2336    4

    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = -0.90735, df = 2491, p-value = 0.3643
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                  -0.03385422 
[1] "Adjusted p-value"
gPhosphorylated 
       1.818986 
[1] 2493    4

Acetylation

Violin plot:

[1] "Non-modified"
[1] 263   4
[1] "Modified"
[1] 2230    4

    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = 5.7916, df = 2491, p-value = 7.849e-09
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                    0.2002254 
[1] "Adjusted p-value"
 gAcetylated 
3.918759e-08 
[1] 2493    4

Ubiquitination

Violin plot

[1] "Non-modified"
[1] 143   4
[1] "Modified"
[1] 2412    4

    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = 0.36469, df = 2553, p-value = 0.7154
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                   0.01750828 
[1] "Adjusted p-value"
gUbiquitinated 
      3.615996 
[1] 2555    4

Methylation

[1] "Non-modified"
[1] 77  4
[1] "Modified"
[1] 2416    4

    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = 3.0915, df = 2491, p-value = 0.002014
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                    0.1590465 
[1] "Adjusted p-value"
gMethylated 
  0.0100538 
[1] 2493    4

oxPTMs

Violin plot:

[1] "Non-modified"
[1] 2493    4
[1] "Modified"
[1] 2644    4

Lysine acylations

[1] "Non-modified"
[1] 257   4
[1] "Modified"
[1] 2369    4

    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = 4.7064, df = 2624, p-value = 2.652e-06
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                    0.1983194 
[1] "Adjusted p-value"
gK acylation 
1.358869e-05 
[1] 2626    4

AGEs

Violin plots:

[1] "Non-modified"
[1] 599   4
[1] "Modified"
[1] 2006    4

    Design-based KruskalWallis test

data:  mean_hl_hours ~ mod_group
t = 4.4658, df = 2603, p-value = 8.315e-06
alternative hypothesis: true difference in mean rank score is not equal to 0
sample estimates:
difference in mean rank score 
                    0.1279503 
[1] "Adjusted p-value"
       gAGEs 
4.244034e-05 
[1] 2605    4

Binning

oxPTMs: